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COMPLEXITY MANAGEMENT AND ANALYSIS OF GENOMIC DNA 

RELATED APPLICATIONS 
This application claims the benefit of U.S. Provisional Apphcation Serial Nos. 
60/105,867, filed 10/27/98. and 60/136,125, filed 5/26/99, the entire teachings of which 
are incorporated herein by reference. 

R AngfiRQUND OF THE INVENTION 
The past years have seen a dynamic change in the ability of science to 
comprehend vast amounts of data. Pioneering technologies such as nucleic acid arrays 
allow scientists to delve into the world of genetics in far greater detail than ever before. 
Exploration of genomic DNA has long been a dream of the scientific community. Held 
within the complex structures of genomic DNA lies the potential to identify, diagnose, 
or treat diseases like cancer, alzheimers or alcoholism. Answers to the world's food 
distribution problems may be held within the exploitation of genomic mfomiation fi-om 
plants and animals. 

It is estimated that by the Spring of 2000 a reference sequence of the entire 
human genome will be sequenced allowing for types of genetic analysis that were never 
before possible. Novel methods of sample preparation and sample analysis are needed 
to provide for the fast and cost effective exploration of complex samples of nucleic 
acids, particularly genomic DNA. 

■<;t tmmary of the invention 
The present invention provides a flexible and scalable method for analyzing 
complex samples of nucleic acids, such as genomic DNA. These methods are not 
limited to any particular type of nucleic acid sample: plant, bacterial, animal (including 
human) total genome DNA, RNA, cDNA and the like may be analyzed using some or 
all of the methods disclosed in this invention. The word "DNA" may be used below as 
an example of a nucleic acid. It is understood that this term includes all nucleic acids, 
such as DNA and RNA, unless a use below requires a specific type of nucleic acid. 
This invention provides a powerfiil tool for analysis of complex nucleic acid samples. 
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From experimental design to isolation of desired fragments and hybridization to an 
appropriate array, the invention provides for faster, more efficioit and less expensive 
methods of complex nucleic acid analysis. 

The present invention provides for novel methods of sample preparation and 
analysis comprising managing or reducing, in a reproducible manner, the complexity of 
a nucleic acid sample. The present invaition eliminates the need for multiplex PGR, a 
time intensive and expensive step in most large scale analysis protocols, and for many 
of the embodiments the step of complexity reduction may be performed entirely in a 
single tube. The invention further provides for analysis of the sample by hybridization 
to an array which may be specifically designed to interrogate fragments for particular 
characteristics, such as, for example, the presence or absence of a polymorphism. The 
invention further provides for novel methods of using a computer system to model 
enzymatic reactions in order to determine experimental conditions and/or to design 
arrays. In a preferred embodiment the invention discloses novel methods of genome - 
1 5 wide polymorphism discovery and genotyping. 

In one embodiment of the invention, the step of complexity management of the 
nucleic acid sample comprises enzymatically cutting the nucleic sample into fragments, 
separating the fragments and selecting a particular fragment pool. Optionally, the 
selected fragments are then ligated to adaptor sequences containing PGR primer 
20 templates. 

In a preferred embodiment, the step of complexity management is performed 

entirely in a single tube. 

In one embodiment of complexity management, a type Hs endonuclease is used 
to digest the nucleic acid sample and the fragments are selectively ligated to adaptor 

25 sequences and then amplified. 

In another embodiment, the method of complexity management utilizes two 
restriction enzymes with different cutting sites and frequencies and two different 

adaptor sequences. - .. 

In another embodiment of the invention, the step of complexity management 
30 comprises performing the Arbitrarily Primed Polymerase Ghain Reaction (AP PGR) 
upon the sample. 



12/13/2005, EAST Version: 2.0.1.4 



wo 00/24939 



3 



PCTAJS99/2S200 



In another embodiment of the invention, the step of complexity management 
comprises removing repeated sequences by denaturing and reannealing the DNA and 
then removing double stranded duplexes. 

In another embodiment of the invention, the step of complexity management 
comprises hybridizing the DNA sample to a magnetic bead which is bound to an 
oligonucleotide probe containing a desired sequence. This embodiment may further 
comprise exposing the hybridized sample to a single strand DNA nuclease to remove 
the single stranded DNA, ligating an adaptor sequence containing a Class U S 
restriction enzyme site to the resulting duplexed DNA and digesting the duplex with the 
appropriate Class H S restriction enzyme to release the magnetic bead. This 
embodiment may or may not comprise amplification of the isolated DNA sequence. 
Furthermore, the adaptor sequence may or may not be used as a template for the PCR 
primer. In this embodiment, the adaptor sequence may or may not contain a SNP 
identification sequence or tag. 
15 In another embodiment, the method of complexity management comprises 

exposing the DNA sample to a mismatch binding protein and digesting the sample with 
a 3' to 5' exonuclease and then a single strand DNA nuclease. This embodiment may or 
may not include the use of a magnetic bead attached to the mismatch binding protein. 

20 RRTEF DESCRTPTION OF THE FIGURES 

Figure 1 is a schematic representation of a method of complexity management 
comprising restriction enzyme digest, fi-agment separation, and isolation and 
purification of a fragment size range of interest. 

Fiaure 2 is a schematic representation of a method of complexity management 
25 comprising restriction enzyme digest, fi-agment separation, isolation and purification of 
a fragmeni size range of interest, ligation of an adaptor sequence to the desired 
firagments and amplification of those fragments. 

Figure 3 depicts the effect on complexity of PCR amplification using primers 
with and without specific nucleotides. 
30 Figure 4 is a schematic representation of a method of complexity management 

comprising a type Hs resfaiction enzyme digest, adaptor sequence ligation and 
amplification of desired firagments. 
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Figure 5 depicts type lis restriction enzymes and their cleavage sites. 
Figure 6 is a schematic representation of a method of complexity management 
comprising a type Ds restriction enzyme digest, adaptor sequence ligation and 
ampUfication of desired fragments. 

Figure 7 is a schematic representation of a method of complexity management 

comprising AP PGR. 

Figure 8 depicts the results of AP PGR on human genomic DNA. 
Figure 9 depicts the reproducibility of AP PGR. 

Figure 10 is a schematic representation of a method of complexity management 
comprising removing repetitive sequences by denaturing and reannealing genomic 
DNA. 

Figure 11 isaschematic representation ofamethod of complexity management 
comprising hybridizing a probe sequence attached to a magnetic bead to a pool of 
fractionated DNA. 

Figure 12 is a schematic representation of a method of complexity management 
comprising hybridizing a probe sequence bound to a magnetic bead to a pool of 
fractionated DNA, ligating an adaptor sequence containing a class Hs restriction 
enzyme site to the DNA/probe duplex, digesting the duplex, ligating a second adaptor 
sequence to the duplex and amplifying. 

Figure 13 is a schematic representation of a method of complexity management 
comprising hybridizing a probe sequence bound to a magnetic bead to a pool of 
fractionated DNA, ligating an adaptor sequence containing a class Hs restriction 
enzyme site to the DNA/probe duplex, digesting the duplex, ligating a second adaptor 
sequence to the duplex and amplifying. 
25 Figure 1 4 depicts a chimeric probe array. 

Figure 1 5 is a schematic representation of a method of complexity management 
comprising hybridizing a probe sequence attached to a magnetic bead to a pool of 
fractionated DNA, ligating an adaptor sequence containing a class lis restriction 
enzyme site to the DNA/probe duplex, digesting the duplex, ligating a second adaptor 
sequence to the duplex, amphfying and hybridizing the amplicons to a chimeric probe 



20 



30 



array. 
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Figure 16 is a schematic representation of a method of complexity management 
comprising hybridizing a mismatch binding protein to DNA containing a 
polymorphism and isolating the region containing the polymorphism. 

Figure 17 is a schematic representation of a method of complexity management 
comprising attaching a magnetic bead to the mismatch binding protein of Figure 16. 

Figure 18 shows digestion of DNA by a combination of restriction enzymes. 

Figure 19 shows digested yeast total gaiomic DNA. 

Exhibit 1 is an example of one type of computer program which can be written 
to model restriction enzyme digestions. 

Exhibit 2 is an example of one type of computer program which can be written 

to model ligation reactions. 

r>FTAn.RD DESCRIPTION OF TF ^? PT?FSFNT INVENTION 
This application reUes on the disclosure of other patent applications and 
literature references. These documents are hereby incorporated by reference in their 
entireties for all purposes. 



Definitions 

A "genome" is all the genetic material in the chromosomes of an organism. 
20 DNA derived from the genetic material in the chromosomes of a particular organism is 
genomic DNA. A genomic library is a collection of clones made from a set of 
randomly generated overiapping DNA fragments representing the entire genome of an 
organism. 

An "oligonucleotide" can be nucleic acid, such as DNA or RNA, and single- or 
25 double-stranded. OHgonucleotides can be naturally occurring or synthetic, but are 
typically prepared by synthetic means. Oligonucleotides can be of any length but are 
usually at least 5, 10. or 20 bases long and may be up to 20, 50, 100, 1 ,000, or 5,000 
bases long. A polymorphic site can occur within any position of the.oligonucleotide. 
Oligonucleotides can include peptide nucleic acids (PNAs) or analog nucleic acids. 
30 See US Patent AppUcation No. 08/630,427 filed 4/3/96. 

An array comprises a solid support with nucleic acid probes attached to said 
support. Airays typically comprise a plurality of different oligonucleotide probes that 
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are coupled to a surface of a substrate in different known locations. These arrays, also 
described as "microarrays" or colloquially "chips" have been generally described in the 
art, for example. U.S. Pat. Nos. 5,143,854, 5445934, 5,744,305, 5,677,195 and PCT 
Patent Publication Nos. WO 90/15070 and 92/10092. Each of which is incorporated by 
reference in its entirety for all purposes. These arrays may generally be produced using 
mechanical synthesis methods or light directed synthesis methods which incorporate a 
combination of photolithographic methods and solid phase synthesis methods. See 
Fodor et al.. Science, 251:767-777 (1991), Pirrung et al.. U.S. Pat. No. 5,143.854 (see 
also PCT Application No. WO 90/15070) and Fodor et al.. PCT Publication No. WO 
92/10092 and U.S. Pat. No. 5.424.186. each of which is hereby incorporated in its 
entirety by reference for all purposes. Techniques for the synthesis of these arrays using 
mechanical synthesis methods are described in. e.g., U.S. Pat. No. 5.384,261, 
incorporated herein by reference in its entirety for all purposes. Although a planar 
array surface is preferred, the array may be fabricated on a surface of virtually any 
15 shape or even a multiplicity of surfaces. Arrays may be nucleic acids on beads, fibers 
such as fiber optics, glass or any other appropriate substrate, see US Patent Nos. 
5,770,358, 5,789,162. 5.708.153 and 5.800.992 which are hereby incorporated in their 
entirety for all purposes. Arrays may be packaged in such a manner as to allow for 
diagnostics or other manipulation of in an all inclusive device, see for example, US 
20 Patent Nos. 5,856,174 and 5,922,591 incorporated in their entirety by reference for all 
purposes. 

Hybridization probes are oligonucleotides capable of binding in a base-specific 
manner to a complementary strand of nucleic acid. Such probes include peptide nucleic 
acids, as described in Nielsen et al.. Science 254, 1497-1500 (1991), and other nucleic 
25 acid analogs and nucleic acid mimetics. See US Patent Application No. 08/630,427 
filed 4/3/96. 

Hybridizations are usually performed under stringent conditions, for example, at 
a salt concentration of no more than 1 M and a temperature of at least 25;C. For 
example, conditions of 5X SSPE (750 mM NaCl, 50 mM NaPhosphate, 5 mM EDTA, 
30 pH 7.4) and a temperature of 25-30°C are suitable for allele-specific probe 

hybridizations. For stringent conditions, see, for example, Sambrook. Fritsche and 
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Maniatis. "Molecular Cloning A laboratory M^ual" 2^ Ed. Cold Spring Harbor Press 
(1989) which is hereby incorporated by reference m Us entirety for all purposes above. 

Polymorphism refers to the occurrence of two or more genetically determmed 
alternative sequences or alleles in a population. A polymorphic marker or site is the 
5 locusatwhichdivergenceoccurs. Preferred markers have at least two alleles, each 
occurring at frequency of greater than 1%, and more preferably greater than 1 0% or 
20% of a selected population. A polymorphism may comprise one or more base 
changes, an insertion,arepeat,ora deletion. A polymorphic locus may be as small as 
one base pair. Polymorphic markers include restriction fragment length 
,0 polymorphisms, variable number of tandem repeats (VNTR's), hypervariable regions, 
minisatelUtes, dinucleonde repeats, trinucleotide repeats, tetranucleotide repeats, stmple 
sequence repeats, and msert.on elements such as Alu. The first identified aUelic form ,s 
arbitrarily designated as the reference form and other alleUc forms are designated as 
alternative or variant alleles. The allelic form occumng most frequently in a selected 
,5 populationissometimesreferredtoasthewildtypeform. Diploid organisms may be 
homozygous or heterozygous for aUeUc forms. A dialleUc polymorphism has two 
forms. A triallelic polymorphism has three forms. 

A single nucleotide polymorphism (SNP) occurs at a polymorphic site occupied 
by a single nucleotide, which is the site of variation between allelic sequences. The site 
20 is usually preceded by and followed by highly conserved sequences of the aUele (e.g.. 
sequences that van, in less than 1/100 or 1/1000 members of the populations). 

A single nucleotide polymorphism usually arises due to substitution of one 
nucleotide for another at the polymorphic site. A transition is the replacement of one 
purine by another punne or one pyrimidine by another pyrimidine. A transversion is 
25 thereplacementofapunnebyapyrimidineorviceversa. Singlenucleot.de 
polymorphisms can also anse from a deletion of a nucleotide or an insertion of a 
nucleotide relative to a reference allele. 

■ An individual is not limited to a human bemg, but may also include other 
organisms includingbut not limitedtomammals.plants, bacteria or cells derived from 

30 any of the above. 
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General 

The present invention provides for novel methods of sample preparation and 
analysis involving managing or reducing the complexity of a nucleic acid sample, such 
as genomic DNA. in a reproducible mamier. The invention further provides for 
analysis of the above sample by hybridization to an array which may be specifically 
designed to interrogate the desired fragments for particular characteristics, such as, for 
example, the presence or absence of a polymorphism. The invention further provides 
for novel methods of using a computer system to model enzymatic reactions in order to 
determine experimental conditions before conducting any actual experiments. As an 
example, the present techniques are useful to identify new polymorphisms and to 
genotype individuals after polymorphisms have been identified. 

Generally, the steps of the present invention involve reducing the complexity of 
a nucleic acid sample using the disclosed techniques alone or in combination. None of 
these techniques require multiplex PCR and most of them can be performed in a single 
tube. With one exception (AP PCR), the methods for complexity reduction involve 
fragmenting the nucleic acid sample, often, but not always by restriction enzyme digest. 
The resulting fragments, or in the case of AP PGR. PCR products, of interest are then 
isolated. The isolation steps of the present invention vary but may involve size 
selection or direct amplification, often adaptor sequences are employed to facilitate 
isolation. In a preferred embodiment the isolated sequences are then exposed to an 
array which may or may not havebeen specifically designed and manufactured to 
interrogate the isolated sequences. Design of both the complexity management steps 
and the arrays may be aided by the computer modeling techniques which are also 
described in the present invention. 



Complexity management 

The present invention provides for a number of novel methods of complexity 
management of nucleic acid samples such as genomic DNA. These methods are 
disclosed below. 

A number of methods disclosed herein require the use of restriction enzymes to 
fragment the nucleic acid sample. Methods of using a restriction enzyme or enzymes to 
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cut nucleic acids at a large number of sites and selecting a size range of restriction 
fragments for assay have been shown. This scheme is illustrated in Figure 1. 

In one embodiment of the invention, schematically illustrated in Figure 2, 
restriction enzymes are used to cut the nucleic acids in the sample (Fig. 2, Step 1). In 
5 general, a restriction enzyme recognizes a specific nucleotide sequence of four to eight 
nucleotides (though this number can vary) and cuts a DNA molecule at a specific site. 
For example, the restriction enzyme Eco RI recognizes the sequence GAATTC and will 
cut a DNA molecule between the G and the first A. Many different restriction enzymes 
are known and appropriate restriction enzymes can be chosen for a desired result. For 
10 example, restriction enzymes can be purchased firom suppliers such as New England 
Biolabs. Methods for conducting restriction digests will be known to those of skill in 
the art, but directions for each restriction enzyme are generally suppUed with the 
restriction enzymes themselves. For a thorough explanation of the use of restriction 
enzymes, see for example, section 5, specifically pages 5.2-5.32 of Sambrook, et al., 
1 5 incorporated by reference above. 

After restriction enzyme digestion, the method further requires that the pool of 
digested DNA fragments be separated by size and that DNA fragments of the desired 
size be selected (Figure 2. step 2) and isolated (Figure 2, Step 3). Methods for 
separating DNA fragments after a restriction digest will be well known to those of skill 
20 in the art. As a non-Umiting example, DNA fragments which have been digested with a 
restriction enzyme may be separated using gel electrophoresis, see for example, 
Maniatis, section 6. In this technique, DNA fragments are placed in a gel matrix. An 
electric field is applied across the gel and the DNA fragments migrate towards the 
positive end. The larger the DNA fragment, the more the fargmenfs migration is 
25 inhibited by the gel matrix. This allows for the separation of the DNA fragments by 
size. A size marker is run on the gel simultaneously with the DNA fragments so that 
the fragments of the desired size may be identified and isolated from the gel. Methods 
for purification of the DNA fragments from the gel matrix are also _described in 
Sambrook et al. 

30 Any other non-destructive method of isolating DNA fragments of the desired 

size may be employed. For example, size-based chromotography, HPLC, dHPLC or a 
sucrose density gradient could be used to reduce the DNA pool to those firagments 
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within a particular size range and then this smaller pool could be run on an 
electrophoresis gel. 

After isolation, adaptor sequences are ligated to the fragments. (Figure 2, Step 
4) Adaptor sequences are generally oligonucleotides of at least 5 or 1 0 bases and 
preferably no more than 50 or 60 bases m length, however, adaptor sequences may be 
even longer, up to 100 or 200 bases depending upon the desired result. For example, if 
the desired Outcome is to prevent amplification of a particular fragment, longer adaptor 
sequences designed to form stem loops or other tertiary structures may be ligated to the 
fragment. Adaptor sequences may be synthesized using any methods knovm to those of 
skill in the art. For the purposes of this invention they may, as options, comprise 
templates for PGR primers and/or tag or recognition sequences. The design and use of 
tag sequences is described in US Patent No. 5.800,992 and US Provisional Patent 
Application No. 60/140,359, filed 6/23/99. Both of which are incorporated by 
reference for all purposes. Adaptor sequences may be ligated to either blunt end or 
sticky end DNA. Methods of ligation will be known to those of skill in the art and are 
described, for example, in Sambrook et al. Methods include DNase digestion to "nick" 
the DNA. ligation with ddNTP and the use of polymerase I to fill in gaps or any other 

methods described in the art. 

Further complexity reduction is achieved by adding a specific nucleotide on the 
5- end of the PGR primer as illustrated in Figure 3. The specific nucleotide fiirther 
reduces the complexity of the resulting DNA pool because only those fragments which 
have been isolated after restriction enzyme digestion and contain the complement of the 
specific nucleotide(s) incorporated in the PGR primer will be amplified. Figure 3 A 
depicts the results of hybridization to an array after enzyme digestion, ligation to an 
adaptor and PGR amplification. Figs. 3B and 3G depict the results of hybridization to 
an array after enzyme digestion. Ugation to an adaptor and PGR amplification where the 
PGR primers incorporated specific nucleotides in the 5' end of the primer. In Fig. 3B ^ 
the 5* and 3' primers have different specific nucleotides incoiporated. In Fig. 3A the 5^ 
and 3- primers have the same nucleotides incorporated. The level of complexity in the 
isolated pool can be varied depending upon the identity and number of nucleotides 
incorporated into the PGR primers. A number of embodiments of the present invention 
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involve amplification by PCR. Any of Aese embodiments may be further modiHed to 
reduce complexity using the above disclosed technique. 

Various methods of conducting PCR amplification and primer destgn and 
constn.ctionforPCRamplific«ion»i!lbekno«ntothoseofsldllintheart. PCR ,sa 
5 methodby«Mchaspecificpolyn»cl«,tidese,uencecanbeampUfiedinv,tro. PCR.S 
an extremely powcrfi.1 technique for amplilying specific polynucleotide s«,uences. 
including genomic DN A. single-strand«J cDNA, and mRNA among others. As 
described m U S Pat. Nos. 4,683,202, 4.683,195, and 4,800,159 (^»hich are 
incorporated herein by reference), PCR typically comprises treating separate 
,0 complementary strandsofatargetnucleic acid with two ohgonucleotidepnmersto 

fonn complementary primer extension products on both s^ds that ac, as templates for 
synthesizing copies of the desir»i nucleic acid s«,uences. By repeating dte separatton 
and synthesis steps in an »..oma..d system, essentially exponential duplication of the 
target sequences can be achieved. Standard protocols may be found in, for example 
,5 Sambrook et al. which is hereby incorporated by reference for all purposes. 

m another embodiment, schematically illustrated in Figure 4, the step of 
complexity management of the DNA samples comprises digestion with a Type lis 
endonuclease thereby creating sticky ends comprised of random mtdeic ac,d 
sequences. (Fig 4, Step 1) Type-Ds endonudeases are generally commerctally 
a„ availableandarewellknownintheart A descriphon of Type Us endonucleases can 
be found in US Patent No. 5,710,000 which is hereby incorporated by reference for all 
purposes. Like Oreir Type-H counterparts, Type-Rs endonucleases recognize spectflc 
sequences of nucleotide base pairs within a double stranded polynucleotide sequence. 
Upon reco^zing that sequence, the endonuclease will cleave the polynucleot,de ^ 
„ sequence, generally leavmg an overhang of one strand of the sequence, or "sticky end. 

Type-Il endomtcleases, howev«. generally requ.re that the specific recognrtron 
site be palindromic. That is. readmg in the 5' to3' direction, the base pair sequencers 

the same for bom suands of therecognitton site. For example, thesequence 
G-t-A-A-T-T-C 



30 



C-T-T-A-A-l-G 
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is the recognition site for the Type-II endonuclease EcoRI. where the arrows indicate 
the cleavage sites in each strand. This sequence is palindromic in that both strands of 
the sequence, when read in the 5' to 3' direction are the same. 

The Type-ns endonucleases, on the other hand, generally do not require 
paUndromic recognition sequences. Additionally, these Type-IIs endonucleases also 

generally cleave outside of their recognition sites. For example, the Type-Es 

endonuclease Earl recognizes and cleaves in the following mamier: 

CTCTTCN INNNN 

GAG AAG nn n n I n 

where the recogn.tion sequence is -C-T-C-T-T-C-. N and n represent complementary, 
ambiguous base pairs and the arrows indicate the cleavage sites in each strand. As the 
example Ulustrates. the recognition sequence is non-palindromic. and the cleavage 
occurs outside of that recognition site. 

Specific Type-IIs endonucleases which are useful in the present invention 
include, e.g., Earl, Mnll, Plel. Alwl, Bbsl. Bsal, BsmAI, BspMI, Esp3I, Hgal, Sapl, 
SfaNI Bbvl, BsmFI, Fokl, BseRI, HphI and MboII. The activity of these Type-IIs 
endonucleases is illustrated in FIG. 5, which shows the cleavage and recognition 
patterns of the Type-Hs endonucleases. 

The sticky ends resulting from Type-IIs endonuclease digestion are then ligated 
toadaptorsequences(Fig4.Step2) Thoseof skill in the art will be famihar with 
methods of ligation. Standard protocols can be found in. for example. Sambrook et al 
hereby incorporated by reference for all purposes. Only those fragments contaimng the 
adaptor sequence are isolated. (Figure 6) 

m addition to those methods of isolation discussed above, metiiods of isolation 
which take advantage of unique tag sequences which may be constructed in the adaptor 
sequences may be employed. These tag sequences may or may not be used as PGR 
primer templates. Fragments containing these tags can then be segregated from other 
non-tag bearing sequences using various methods of hybridization or any of the 
methods described in the above referenced application. 

m another embodhnent. depicted in Figure 18. die metiiod of complexity 
reduction comprises digesting the DNAsamplewiti, two differentrestnction enzymes. 

The first restnction enzyme is a frequent base cutter, such as MSE I which has a four 
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base recognition sUe. The second restriction enzyme is a rare base cutter, such as Eco 
RI which has a 6 base recognition site. This results in three possible categones of 
frJgments;(mostconunon)those which have been cut on both ends with the frequent 

base cutter.(leastcommon)those which have been cut on both ends with the rare base 
5 cutter, and those which have been cut on one end w.th the frequent base cutter and on 
one end with the rare base cutter. Adaptors are ligated to the fi-agments and PGR 
primers are designed such that only those fragments which fall into the desired catego^ 
or categories are amplified. This technique, employed with a six base cutter and a four 
base cutter can reduce complexity 8-fold when only those fragments from the latter 
10 category are amplified. Other combinations of restriction enzymes may be employed to 
achieve the desired level of complexity. 

In another embodiment, the step of complexity management comprises 
removing repetitive sequences. Figure 10 depicts a schematic representation of this 
embodiment. The nucleic acid sample is first fragmented. (Figure 10. Step 1) 
15 VariousmethodsoffragmentingDNAwillbeknowntothoseofskiUintheart. These 

methods may be, for example, either chemical or physical in natture. Ghemical 
fragmentation mayinclude partial degradation withaDNAse. partial depurinatxonwuh 

acid theuseofrestriction enzymes or other enzymes which cleaveDNA at known or 

unknown locations. Physical fragmentation methods may involve subjecting the DN A 
20 toahighshearrate. High shear rates may be produced, for example, by moving DNA 
throughachamber or channel withpits or spikes, or forcing the DNA sample througha 

restricted size flow passage, e.g.. an aperture having a cross sectional dimension m the 

micron or submicron scale. 

In a preferred embodiment adaptor sequences are ligated to the resulting 

25 fragments. (Figure 10, Step 2) The fragments with or without adaptor sequences are 
then denatured. (Figure 1 0, Step 3) Methods of denaturation will be will known to 
those of skill in the art. After denaturation, the fragments are then allowed to reanneal. 
(Figure 10. Step 4) Annealing conditions may be altered as appropriate to obtam the 
level of repetitive sequence removal desired. Finally, double stranded sequences are 

30 removed (Figure 10. Step 4). Methods of removing double stranded sequences w.ll be 
known to those of skill in the art and may include without limitation, methods of 
digesting double stranded DNA such as double strand specific nucleases and 
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exonucleases or methods of physical separation including, without limitation gel based 
electrophoresis or size chromotography. 

In another embodiment, the step of complexity management comprises 
performing an arbitrarily primed polymerase chain reaction (AP PGR) upon the sample. 
AP PGR is described in US Patent No. 5,487,985 which is hereby incorporated by 
reference in its entirety for all purposes. Figure 7 depicts a schematic illustration of 
this embodiment. Performing AP PGR with random primers which have specific 
nucleotides incorporated into the primers produces a reduced representation of genomic 
DNA in a reproducible manner. Figure 8 shows the level of complexity reduction of 
human genomic DNA resulting firom AP PGR with various primers. Golumn 1 lists the 
primer name. Golumn 2 lists the primer sequence. Column 3 lists the annealing 
temperature. Golumn 4 lists the polymerase used. Golumn 5 lists the number 
correlated to a specific gene on the Hum6.8K GeneGhip(R) probe array (Affymetrix, 
Inc. Santa Glara, Ca). Column 6 lists the percentage of the human genes on the 
Hum6.8K GeneChip(R) probe array found by fragments whose complexity has been 
reduced by this method. Figure 9 shows the reproducibility of AP PGR. Independently 
prepared samples preps were subjected to AP PGR using the same primers. The gel 
bands show that the level of reproducibility between the samples is very high. 

Primers may be designed using standard techniques. For example, a computer 
program is available on the internet at the Operon Technologies, Inc. website at http: 
www.operon.com. The Operon OUgo Toolkit allows a user to input a potential primer 
sequence into the webform. The site will instantly calculate a variety of attributes for 
the oligonucleotide including molecular weight, GC content, Tm, and primer-dimer 
sets. You may also plot the oligonucletoide against a second sequence. PGR 
amplification techniques are described above in this application and will be well known 
to those of skill in the art. 

In another embodiment of the invention, the method reducing the complexity of 
a nucleic.acid sample comprises hybridizing the sample to a nucleic acid probe 
containing a desired sequence which is bound to a solid support, such as a magnetic 
bead. For a description of hybridization of nucleic acids to solid supports, see US Pat 
No. 5,800,992 incorporated by reference above. This sequence may comprise, for 
example, a sequence containing a SNP, a cDNA fragment, a chromosome firagment, a 
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subset of genomic DNA or a subset of a library. The sequence may comprise as few as 
1 6 nucleotides and may comprise as many as 2,000, 3,000, 5,000 or more nucleotides 
in length. Methods of designing and making oligonucleotide probes will be well 
known to those of skill in the art. In one embodiment, the probe may contain a 
template sequence for a PCR primer. Solid supports suitable for the attachment of 
nucleic acid probe sequences will be well known to those of skill in the art but may 
include, glass beads, magnetic beads, and/or planar surfaces. Magnetic beads are 
commercially available from, for example, Dynal (Oslo, Norway). The nucleic acid 
probes may be synthesized directly on the solid support or attached to the support as a 
full length sequence. Protocols for attaching magnetic beads to probes are included in 
US Patent No. 5,512,439 which is hereby incorporated by reference for all purposes. 
Standard hybridization protocols as discussed above may be employed. 

Figure 1 1 depicts a schematic representation of one example of the above 
embodiment, wherein the complexity management step is utilized to facilitate genome 
wide genotyping. Much of the cost of genotyping comes from multiplex PCR. In this 
embodiment, the entire sample preparation can be performed in a single tube without 
the need for multiplex PCR. Because the desired result is to genotype a DNA sample, 
the desired sequence in Figure 1 1 contains a polymorphism. The oligonucleotide 
comprises 32 bases with the SNP in the center. A magnetic bead is attached to the 
oligonucleotide probe. (Fig. 1 1, step 1) The probe is then exposed to, for example, 
fractionated genomic DNA. (Fig.l 1, step 2). Adaptor sequences are ligated to both 
ends of the fragments. (Fig. 1 1, step 3). The fragments are then amplified (Fig. 11, 
step 4) and the PCR product containing the desired polymorphism may then be 
analyzed by various methods including, for example, hybridization to an array or single 
base extension (SBE). SHE is described in, for example US Provisional Application 
60/140,359 which is hereby incorporated by reference in its entirety for all purposes. 

The method may further comprise exposing the hybridized sample to a single 
strand DNA nuclease to remove the single stranded DNA. This embodiment may 
further comprise ligating an adaptor sequence containing a Class II S restriction 
enzyme site to resulting duplexed DNA and digesting the duplex with the appropriate 
Class II S restriction enz3ane to release the attached sequences. The sequences are then 
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isolated and a second adaptor sequence is ligated to the complex and the sequences are 
amplified. 

Figures 12 and 13 depict schematic representations of an embodiment 
comprising the use of ClassUs endonucleases. Both figures depict methods which may 

5 be employed for single tube genotyping without the need for multiplex PCR. In 

Figures 12 and 13, the desired sequence is a SNP. The oligonucleotide probe in Figure 
12 is 32 bases long and in Figure 13 is 1 7 bases long. In both figures the SNP is in the 
center of the oligonucleotide. The oligonucleotide probe is bound to a magnetic bead. 
(Figs. 1 2 and 13, step 1). The probe is then hybridized to firagmented genomic DNA 

10 (Figs. 1 2 and 13, step 2). Single stranded DNA is digested with a single strand DNA 
nuclease leaving a DNA duplex attached to the magnetic bead. (Figs. 1 2 and 1 3, step 
3). An adaptor sequence is then ligated to the duplex. The adaptor sequence contains a 
Class IIS restriction site. The probe length and Class IIS endonuclease are chosen such 
that the site where the duplex is cut is between the SNP and the magnetic bead. In 

15 Figure 1 2 the Class ITS endonuclease cuts directly adjacent to the SNP site, such that 
the SNP is part of the sticky end left by the endonuclease digestion. (Fig. 12, step 5) 
In Figure 1 3 the endonuclease cuts closer to the magnetic bead, leaving a number of 
bases between the sticky end and the SNP site. (Fig. 13, step 5) In either case, the 
magnetic bead is released and the sequences are isolated. Adaptor sequences are then 

20 ligated to the sticky ends. (Figs. 12 and 13, step 6) In both Figures 1 2 and 13 the 
adaptor sequences contain templates for PCR probes. The firagments containing the 
SNP are then amplified (Figs. 12 and 1 3, step 7) and the PCR products may be 
analyzed in a number of different methods including hybridization to an array designed 
to detect SNPs or SBE. 

25 In this embodiment, the adaptor sequence may fiirther comprise a SNP 

identification sequence or tag. In this case, the array to which the PCR products are 
hybridized may be a generic tag array as described in the above referenced US Patent 
No. 5,800,992 and US Provisional Patent Application 60/140,359 or a chimeric probe 
array (Figure 14). A chimeric probe array contains probes which interrogate both for 

30 particular sequences characteristic of a genotype as well as for artificial sequences 
which have been ligated to specific fragments in the sample pool. This allows for 
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higher specificity of hybridization and better differentiation between probes. This 
embodiment is depicted in Figure 1 5. 

In another embodiment, depicted in Figure 16 the method of complexity 
reduction comprises hybridizing the DNA sample to a mismatch binding protein. Fig. 
5 16, step 2. Mismatch binding proteins are described in Wagner R. and Radman, M. 
(1995) "Methods: A Companion to Methods in Enzymology" 7, 199-203 which is 
hereby incorporated by reference in its entirety for all purposes. Mismatch binding 
proteins preferentially bind to DNA duplexes which contain sequence mismatches. 
This allows for a relatively simple and rapid method to locate and identify SNPs. In 
10 this embodiment no prior knowledge of the SNP is required. Mismatch binding proteins 
are commercially available through GeneCheck (Ft. Collins, Co.). In a further 
embodiment, depicted in Figure 17, magnetic beads are attached to the mismatch 
binding proteins. Mismatch binding proteins attached to magnetic beads are 
commercially available through GeneCheck (Ft. Collins, Co.). After hybridization the 
15 sample is digested with a 3' to 5* exonuclease (Fig. 16, step 3). Remaining single 
stranded DNA is then removed with a nuclease (Fig. 16, step 4). 

If it is desired to cut the duplex at the mismatch, then the enzyme resolvase may 
be used. See US Patent Nos. 5,958,692, 5,871,91 1 and 5,876,941 (each of which is 
incorporated by reference in their entireties for all purposes) for a description of various 
20 methods of cleaving nucleic acids. The resolvases (e.g. X-solvases of yeast and 
bacteriophage T4, Jensch et ah EMBO J. 8, 4325 (1989)) are nucleolytic enzymes 
capable of catalyzing the resolution of branched DNA intermediates (e.g., DNA 
cruciforms) which can involve hundreds of nucleotides. In general, these enzymes are 
active close to the site of DNA distortion (Bhattacharyya et al., J. Mol. Biol., 221, 
25 1 191, (1991)). T4 Endonuclease VII, the product of gene 49 of bacteriophage T4 (KJeff 
et ah. The EMBO J. 7, 1527, (1988)) is a resolvase (West, Annu. Rev. Biochem. 61, 
603, (1992)) which was first shown to resolve HolHday-structures (Mizuuchi et al., Cell 
29, 357, (1982)). T4 Endonuclease VII has been shown to recognize DNA cruciforms 
(Bhattacharyya et al, supra; Mizuuchi et al, supra) and DNA loops (Kleff et al., supra), 
30 and it may be involved in patch repair. Bacteriophage T7 Endonuclease I has also been 
shown to recognize and cleave DNA cmcifomis (West, Ann. Rev. Biochem. 61, 603, 
(1992)). Eukaryotic resolvases, particularly from the yeast Saccharomyces cerevisiae. 



12/13/2005, EAST Version: 2.0.1.4 



wo 00/24939 



18 



PCTAjS99/25200 



have been shown to recognize and cleave cruciform DNA (West, supra; Jensch, et al., 
EMBO J. 8, 4325 (1989)). Other nucleases are known which recognize and cleave 
DNA mismatches. For example, SI nuclease is capable of recognizing and cleavmg 
DNA mismatches formed when a test DNA and a control DNA are annealed to form a 
5 heteroduplex (Shenk et aL, Proc. Natl Acad, Sci. 72, 989, (1975)). The Nut Y repair 
protein of E. coli is also capable of detecting and cleaving DNA mismatches. 

Computer Implemented Analvsis 

In another embodiment a computer system is used to model the reactions 
10 discussed above to aid the user in selecting the correct experimental conditions. In this 
embodiment, the sequence of the DNA sample must be known. A computer program 
queries an electronic database containing the sequence of the DNA sample looking for 
sites which will be recognized by the enzyme being used. The method of modeling 
experiments can be employed for a wide variety of experiments. 
15 In one embodiment, the user can run multiple experiments altering various 

conditions. For example, if a user desires to isolate a particular sequence of interest in 
a fragment which has been digested with a restriction enzyme, the user can have the 
computer model the possible outcomes using a wide variety of restriction enzymes. 
The particular sequence which is selected may be chosen by specific criteria, i.e. 
20 because the region is believed to be associated with specific genes, polymorphisms, or 
phenotypes for example, or may be chosen at random. The user can then select the 
restriction enzyme which, for example, isolates the desired sequence in a fragment of 
unique size. Additionally or alternatively, if the user desires to reduce complexity 
using the type IIS nuclease/ligation technique described above, the user can experiment 
25 with the length and sequence of the adaptors to determine the optimal sequence for the 
adaptors* "sticky" ends. This enables the user to be confident that they will obtain a 
fragment containing a particular sequence of interest or to fine tune the level of 
complexity in the DNA pool. In another embodiment, a user could model the kinetics 
of the denaturing, reannealing technique for removal of repeated sequences discussed 
30 above to determine the conditions which allow for the desired result. For example, a 
user may desire the removal of only a certain percentage of repeated sequences. 
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For example, virtual restriction digests may be performed by querying an 
electronic database which contains the sequence of DNA of interest. Because the 
database contains the nucleic acid sequence and restriction enzymes cut at known 
locations based on the DNA sequence, one can easily predict the sequence and size of 
fragments which will result from a restriction digest of the DNA. Ideally, restriction 
enzymes which produce no two fragments of the same or very similar size are desired. 
Combinations of restriction enzymes may be employed. Those of skill in the art will be 
familiar with electronic databases of DNA sequences. GenBank, for example, contains 
approximately 2,570,000,000 nucleic acid bases in 3,525,000 sequence records as of 
April 1999. A computer program searches the electronic database for a sequence which 
suits the requirements of the particular restriction enzyme. For example, the restriction 
enzyme Eco RI recognizes the sequence GAATTC and will cut a DNA molecule 
between the G and the first A. The computer program will query the chosen sequence 
for any occurences of the sequence GAATTC and mark the site where the restriction 
enzyme will cut. The program will then provide tiie user with a display of the resulting 
fragments. 

Exhibit 1 is an example of a program to conduct this type of virtual enzyme 
digestion. Exhibit 2 is an example of a program to virtually model the ligation of two 

sequences to each other. 

In another embodiment, the method of modeling experiments in a computer 
system can be used to design piobe arrays. A database may be interrogated for any 
desired sequence, for example, a polymorphism. Computer modeled reactions are then 
performed to help determine the method for isolating a fragment of DNA containing 
tiie sequence of interest. These methods may comprise any of the methods described 
above, alone or in combination. Arrays are then constinicted which are designed to 
interrogate the resulting fragments. It is important to note tiiat for the purpose of 
designing arrays, the virtual reactions need not be performed flawlessly, since the 
arrays may contain hundreds of thousands of sequences. 

One embodiment of the invention relies on the use of virtual reactions to 
predetermine the sequence of chosen DNA fragments which have subjected to various 
procedures. The sequence information for the chosen fragments is then used to design 
the probes which are to be attached to DNA arrays. Arrays may be designed and 
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manufactured in any number of ways. For example, DNA arrays may be synthesized 
directly onto a solid support using methods described in, for example US Patent Nos. 
5,837,832, 5,744,305 and 5,800,992 and W095/1 1995 herein incorporated by 
reference for all purposes. See also, Fodor et al., Science, 25 1 .161-111 (1991), Pirrung 
et al., U.S. Pat. No. 5,143,854 (see also PCT Application No. WO 90/15070) and Fodor 
et al., PCT Publication No. WO 92/10092 and U.S. Pat. No. 5,424,1 86, each of which 
is hereby incorporated in its entirety by reference for all purposes. Techniques for the 
synthesis of these arrays using mechanical synthesis methods are described in, e.g., 
U.S. Pat. No. 5,384,261, incorporated herein by reference in its entirety for all purposes 
Briefly, 5,837,832 describes a tiling method for array fabrication whereby probes are 
synthesized on a solid support. These arrays comprise a set of oligonucleotide probes 
such that, for each base in a specific reference sequence, the set includes a probe (called 
the "wild-type" or "WT" probe) that is exactly complementary to a section of the 
sequence of the chosen fragment including the base of interest and four additional 
probes (called "substitution probes"), which are identical to the WT probe except that 
the base of interest has been replaced by one of a predetermined set (typically 4) of 
nucleotides. Probes may be synthesized to query each base in the sequence of the 
chosen fragment. Target nucleic acid sequences which hybridize to a probe on the 
array which contain a substitution probe indicate the presence of a single nucleotide 
polymorphism. Other applications describing methods of designing tiling arrays 
include: US Patent Nos. 5,858,659, and 5,861,242 each of which is incorporated by 
reference in its entirety for all purposes. In a similar manner, arrays could be 
constructed to test for a variety of sequence variations including deletions, repeats or 
base changes greater than one nucleotide. US Patent Nos. 5,593,839 and 5,856,101 
(each of which is incorporated by reference for all purposes) describe methods of using 
computers to design arrays and lithographic masks. 

The label used to detect the target sequences will be determined, in part, by the 
detection methods being applied. Thus, the labeling method and label used are selected 
in combination with the actual detecting systems being used. Once a particular label has 
been selected, appropriate labeling protocols will be applied, as described below for 
specific embodiments. Standard labeling protocols for nucleic acids are described, e.g., 
in Maniatis; Kambara, H. et al. (1988) BioTechnology 6:816-821 ; Smith, L. et al. 
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(1985) Nuc. Acids Res. 13:2399-2412; for polypeptides, see, e.g., Allen G. (1989) 
Sequencing of Proteins and Peptides, Elsevier, N.Y., especially chapter 5, and 
Greenstein and Winitz (1961) Chemistry of the Amino Acids, Wiley and Sons, N.Y. 
Carbohydrate labeling is described, e.g., in Chaplin and Kennedy (1986) Carbohydrate 

5 Analysis: A Practical Approach, IRL Press, Oxford. Other techniques such as TdT end 
labeling may likewise be employed. Techniques for labeling protocols for use with 
SBE are described in, e.g. US Provisional Patent Application 60/140,359 which is 
incorporated by reference above. 

Generally, when using a DNA array a quickly and easily detectable signal is 

10 preferred. Fluorescent tagging of the target sequence is often preferred, but other 

suitable labels include heavy metal labels, magnetic probes, chromogenic labels (e.g., 
phosphorescent labels, dyes, and fluorophores) spectroscopic labels, enzyme linked 
labels, radioactive labels, and labeled binding proteins. Additional labels are described 
in U.S. Pat. Nos. 5,800,992 and 4,366,241, and published PCT Application WO 

15 99/1 33 1 9 which are incorporated herein by reference. 

The hybridization conditions between probe and target should be selected such 
that the specific recognition interaction, i.e., hybridization, of the two molecules is both 
sufficiently specific and sufficiently stable. See, e.g., Hames and Higgins (1985) 
Nucleic Acid Hybridisation: A Practical Approach, IRL Press, Oxford. These 

20 conditions will be dependent both on the specific sequence and often on the guanine 
and cytosine (GC) content of the complementary hybrid strands. The conditions may 
often be selected to be universally equally stable independent of the specific sequences 
involved. This typically will make use of a reagent such as an alkylammonium buffer. 
See, Wood et al. (1985) "Base Composition-independent Hybridization in 

25 Tetramethylammonium Chloride: A Method for Oligonucleotide Screening of Highly 
Complex Gene Libraries," Proc. Natl Acad. Sci. USA, 82:1585-1588; and Krupov et 
al. (1989) "An Oligonucleotide Hybridization Approach to DNA Sequencing," FEBS 
Letters, 256:1 18-122; each of which is hereby incorporated herein by reference. An 
alkylammonium buffer tends to minimize differences in hybridization rate and stability 

30 due to GC content. By virtue of the fact that sequences then hybridize with 

approximately equal affinity and stability, there is relatively little bias in strength or 
kinetics of binding for particular sequences. Temperature and salt conditions along 
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with other buffer parameters should be selected such that the kinetics of renaturation 
should be essentially independent of the specific target subsequence or oligonucleotide 
probe involved. In order to ensure this, the hybridization reactions will usually be 
performed in a single incubation of all the substrate matrices together exposed to the 

5 identical same target probe solution under the same conditions. The hybridization 
conditions will usually be selected to be sufficiently specific such that the fidelity of 
base matching will be properly discriminated. Of course, control hybridizations should 
be included to deteimine the stringency and kinetics of hybridization. See for example, 
US Patent No. 5,871,928 which is hereby incorporated in its entirety for all purposes. 

10 Another factor that can be adjusted to increase the ability of targets to hybridize 

to probes is the use of nucleic acid analogs of PNAs in the probes. They can be built 
into the probes to create a more uniform set of hybridization conditions across the 
entire array. See US Patent Application No. 08/630,427 incorporated by reference 
above. 

15 The detection methods used to determine where hybridization has taken place 

will typically depend upon the label selected. Thus, for a fluorescent label a fluorescent 
detection apparatus will typically be used. Pirrung et al. (1992) U.S. Pat. No. 5,143,854 
and Ser. No. 07/624,120, now abandoned, (both of which are hereby incorporated by 
reference for all purposes) describe apparatus and mechanisms for scanning a substrate 

20 matrix using fluorescence detection, but a similar apparatus is adaptable for other 
optically detectable labels. See also, US Patent Nos. 5,578,832, 5,834,758, and 
5,837,832 each of which is incorporated by reference in its entirety for all purposes. 

A variety of methods can be used to enhance detection of labeled targets bound 
to a probe attached to a solid support. In one embodiment, the protein MutS (from E. 

25 coli) or equivalent proteins such as yeast MSH 1 , M SH2, and MSH3; mouse Rep-3, and 
Streptococcus Hex- A, is used in conjunction with target hybridization to detect probe- 
target complex that contain mismatched base pairs. The protein, labeled directly or 
indirectly, can be added during or after hybridization of target nucleic acid, and 
differentially binds to homo- and heteroduplex nucleic acid. A wide variety of dyes 

30 and other labels can be used for similar purposes. For instance, the dye YOYO-1 is 

known to bind preferentially to nucleic acids containing sequences comprising runs of 3 
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or more G residues. Signal amplification methods as described in US Patent 
Application No. 09/276,774 may likewise be used. 

Various methods of hybridization detection will be known to those of skill in 
the art. See for example, US Patent Nos. 5,578,832, 5,631,734, 5,744,305 and 
5 5,800,992 each of which is hereby incorporated in its entirety for all purposes. 

Examples 

10 Example 1 - Restriction Enzvme Digest/Sizing 

The complexity of total genomic DNA from human and yeast was reproducibly 
reduced using a restriction enzyme digestion. For each species 0.5 ug genomic DNA 
was digested with 20 units of EcoRI in a total volume of 40 ul at 37 ^^C overnight 
(Figure 2, Stepl). The enzyme was inactivated by incubation at 65 for 10 minutes. 

15 The DNA solution was mixed with 10 ul 5x loading buffer and separated by gel 

electrophoresis on a 2% agarose gel. (Figure 2, Step 2) The gel was visualized by 
ethidium bromide staining. Fragments of 250 - 350 bp were excised from the gel and 
purified using a QIAquick gel extraction kit (Qiagen). (Figure 2, Step 3) Alternatively, 
fragments of the required size could have been isolated using HPLC. 

20 Adaptor sequences containing PGR primer template sequences were then 

ligated to the purified fragments using lOOU T4 ligase in Ix T4 DNA ligase buffer 
(New England Biolabs) at 16 '^C overnight. The adaptor sequences were 5'- 
d(pAATTCGAACCCCTTCGGATC>3' and 5'-d(GATCCGAAGGGGTTCGAATT)-3' 
(Figure 2, Step 4) The ligase was then heat inactivated at 65 °C for 15 minutes. 

25 The fragments were then subjected to PGR with one primer that corresponded to 

the PGR primer template sequence 5'-d(GATCCGAAGGGGTTGGAATT)-3' (Figure 
2, Step 5). The PGR mixture contained approx. 1 ng ligated DNA fragments, 5 units 
AmpliTaq Gold polymerase (Perkins. Ehner), 5 uM pimer, 200uM dNTPs, .15 mM Tris- ^ . 
HGl (pH8.2), 50 mM KGl, 2.5 mM MgCh in a final volume of 50 ul. PGR was 

30 performed in a Perkin-Ebner 9600 thermocycler using an initial 10 minute denaturation 
at 95 ""G, 35 cycles of a 1 minute denaturation at 94 ""G, annealing for 1 minute at 57 ""G 
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and extension at 72 ""C for 2 minutes. This is followed by a final 5 minute extension 
cycle at 72 °C. 

The PCR products were then purified with QIAquick PCR Purification kit 
(Qiagen) according to the manufacturer's instructions and fragmented with DNase I. 

The remaining fi-agments were then labeled with biotin-N6-ddATP as follows: 
In each tube, incubate 10 ug DNA with 03 unit Dnasel (Promega) at 37 °C for 30 
minutes in a 45 ul mixture also containing 10 mM Tris-Actate (pH 7.5), 10 mM 
magnesium acetate and 50 mM potassium acetate. Stop the reaction by heating the 
sample to 95 ""C for 1 5 minutes. Label the sample by adding 60 unit terminal 
transferase and 4 pmol biotin-N6-ddATP (Dupont NEN) followed by incubation at 37 
for 90 minutes and a final heat inactivation at 95 for 1 5 minutes. 

The labeled DNA was then hybridized to an array in a hybridization mixture 
containing 80 ug labeled DNA, 160 ug human COT-1 DNA (GIBCO), 3.5 M 
tetramethylamonium cloride, 10 mM MES (pH 6.5), 0.01% Triton-100, 20 ug herring 
sperm DNA, 100 ug bovine serum albumin and 200 pM control oligomer at 44 ""C for 
40 hours on a rotisserie at 40 rpm. The arrays were then washed with 0.1 M MaCl in 
10 mM MES at 44 °C for 30 minutes on a rotisserie at 40 rpm. The hybridized arrays 
were then stained with a staining solution [10 mM MES (ph 6.5), 1 M NaCl, 10 ug/ml 
steptaviden R-phycoerythrin, 0.5 mg/ml acetylated BSA, 0.01% Triton-100] at 40 ""C 
for 1 5 minutes. The arrays were then washed with 6x SSPET [0.9 M NaCl, 60 mM 
NaH2P04 (pH 7.4), 6 mM EDTA, 0.005 % Triton-100] on a GeneChip® Fluidics 
Station (Affymetrix, Inc., Santa Clara, CA) 10 times at 22 The arrays were then 
anti-streptavidin antibody stained at 40 °C for 30 minutes with antibody solution 
[lOmM MES (pH 6.5), 1 M NaCl, 10 ug/ml streptavidin R-phycoerythrin, 0.5 mg/ml 
actylated BSA, 0.01% Triton-100]. The an^ys are then restained with staining solution 
for 15 minutes followed by 6X SSPET washing as above. The arrays are then scanned 
with a confocal scanner at 560 nm. The hybridization patterns were then screened for 
SNP detection with a computer progriam as described in D.G. Wang et al Science 280, 
1077-1082, 1998. The results of the hybridization can be seen in Figures 8A and 8B. 

Example 2 - Digestion with a Type lis Endonuclease and Selective Ligation 
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Complexity was reproducibly reduced after digestion with a type lis 
endonuclease and selective ligation to an adaptor sequence. 2 ug of genomic DNA was 
digested with Bbv I at 37 ""C overnight. (Figure 3, Step 1) The enzyme was heat 
inactivated at 65 °C for 15 minutes. 

Adaptors containing PGR primer template sequences were ligated in a 50 ul 
mixture of 400 ng digested genomic DNA, 10 pmol adaptor and 40 unit T4 ligase in a 
IX T4 ligase buffer. (Figure 3, Step 2) The adaptor sequences were as follows: 5'- 
d(pATNNGATCCGAAGGGTTCGAATTC)-3' and 

5'GAATTCGAACCCCTTCGGATC)-3'. The Hgation was conducted at le^'C 
overnight. The ligase was inactivated by incubation at 65^C for 15 minutes. 

The fragments were then subjected to PGR with one primer that corresponded to 
the PGR primer template sequence: 5'~GAATTGGAACCGCTTCGGATC)-3' in a 50 
ul reaction containing 20 ng ligated DNA, 1 unit AmpliTaq Gold polymerase (Perkins 
Elmer), 3 uM primer, 200uM dNTPs, 15 mM Tris-HGl (pH8.0), 50 mM KCl, 2.5 mM 
MgCh PGR was performed in a Perkin-Elmer 9600 thermocycler using an initial 10 
minute denaturation at QS^'G, 35 cycles of a 0.5 minute denaturation at 94''C, annealing 
for 0.5 minute at 57*'C and extension at 72°C for 2 minutes. This is followed by a final 
5 minute extension cycle at 72'^C. 

Example 3 - Double Dieestion and Selective PGR 

Human genomic DNA was digested in a 40 ul reaction at 37 °C for 1 hour. The 
reaction mixture contained 0.5 ug human genomic DNA, 0.5 mM DTT, 5 unit EcoRI 
(New England Biolabs), 5 units Sau3AI (New England Biolabs), 0.5 ng/ul BSA, 10 
mM Tris-Acetate (pH 7.5), 10 mM magnesium acetate and 50 mM potassium acetate. 
The enzymes were inactivated at 65 for 15 minutes. 

The restriction fragments were then ligated to adaptor sequences. The hgation 
mixture contained: 5 pmol Eco R I adaptor [5'-d(pAATTCGAACCCCTTCGGATC)-3' 
and 5'-d(GATCCGAAGGGGTTCG)-3'], 50 pmol Sau3A I adaptor [5'- - 
d(pGATCGCCCTATAGTGAGTCGTATTACAGTGGACCATCGAGGGTCA).3'],5 
mM DTT, 0.5 ng/ul BSA, 100 unit T4 DNA hgase, 1 mM ATP, 10 mM Tris-Acetate 
(pH 7.5), 10 mM magnesium acetate and 50 mM potassium acetate]. The ligation 
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mixture was incubated with the restriction fragments at 3TC for 3 hours. The ligase 
was inactivated at 65 for 20 minutes. 

The Ugated DNA target was then amplified by PGR. The PGR mixture 
contained 12.5 ng ligated DNA, 1 unit AmpliTaq Gold polumerase (Perkins Elmer), 
0.272 mM EcoRI selective primer (5'.AAGGGGTTGGGAATTGGG-3'; GC as the 
selective bases), 0.272 uM Sau3AI selective primer (5'-TGAGTATAGGGGGATGTG- 
3'; TG as the selective bases), 200 uM dNTPs, 15 mM Tris-HCl (pH 8,0), 50 mM KCl, 
2.5 mM MgCh in a final volume of 50 ul. PGR was performed in a Perkin-Ehner 9600 
thermocycler using an initial 10 minute denaturation at 95 °G, 35 cycles of a 1 minute 
denaturation at 94 °G, annealing for 1 minute at 56 ""C and extension at 72 for 2 
minutes. This is followed by a final 5 minute extension at 72 ^G. 



Example 4. Arbitarilv Primed PGR 

PGR pimers were designed with the Operon Oligo Toolkit described in the 

specification above. 

Human genomic DNA was amplified in a 100 ul reaction containing 100 ng 
genomic DNA, 1.25 units AmpliTaq Gold polymerase (Perkin Elmer), 10 uM arbitary 
primer, 200 mM dNTPs, 10 mM tris-HCl (pH 8.3), 50 mM KGl and 2.5 mM MgGh. 

PGR was performed in a Perkin-Elmer 9600 thermocycler using an initial 10 
minute denaturation at 95 ""G, 35 cycles of a 1 minute denaturation at 94 annealing 
for 1 minute at 56 "^G and extension at 72 for 2 minutes. This is followed by a final 7 

minute extension at 72 °G. 

The PGR product was then purified, fragmented, labeled and hybridized as 
described in the examples above. 

Example 5 - SNP discove rv - Generallv 

As an example, the present invention may be directed to a method for 
simplifying the detection of or comparing the presence of absence of SNPS among 
individuals, populations, species or between different species. This invention allows 
for a quick and cost-effective method of comparing polymorphism data between 
multiple individuals. First, a reduced representation of a nucleic acid sample is 



12/13/2005, EAST Version: 2.0.1.4 



wo 00/24939 



27 



PCT/US99/25200 



produced in a repeatable and highly rq)roducibIe manner from multiple individuals, 
using any of the above described techniques alone or in combination. Then, the data 
generated by hybridizing the DNA samples collected from multiple individuals to 
identical arrays in order to detect for the presence or absence of a number of sequence 
5 variants is compared. Arrays are designed to detect specific SNPS or simply to detect 
the presence of a region known to frequently contain SNPS. In the latter case, other 
techniques such as sequencing could be employed to identify the SNP. 

SNP discovery - method 1 

10 Typically, the detection of SNPs has been made using at least one procedure in 

which the nucleic acid sequence that may contain the SNP is amplified using PCR 
primers. This use can create an expense if many SNPs are to be evaluated or tested and 
it adds significantly more time to the experiment for primer design and selection and 
testing. The following example eliminates the need for the specific PCR amplification 

15 step or steps. First, using the methods provided in example 1 above, a restriction 

enzyme or enzymes is used to cut genomic DNA at a large number of sites and a size 
range of restriction fragments is selected for assay. An electronic database, such as 
GenBank is queried to determine which sequences would be cut with the specific 
restriction enzyme(s) that were selected above. The sequences of the resulting 

20 firagments are then used to design DNA arrays which will screen the regions for the 
SNPs or other variants. The selected fragments are then subjected to further 
fragmentation and hybridized to the array for analysis. 

SNP discovery - Method 2 

25 Alternatively, the method provided in example 2 above may be employed, type 

IIS restriction enzymes cut genomic DNA from each individual and adaptor sequences 
are designed to ligate to specific fragments as desired. Adaptor sequences may include 
both random and specific nucleotide ends as required to produce the desired result. If 
desired, amplification primers may be designed to hybridize to the adaptor sequences, 

30 allowing for amplification of only the fragments of interest. An electronic database and 
computer modeling system may be used to aid in the selection of appropriate 
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experimental conditions and to design the appropriate arrays. The fragments are then 
hybridized to the array for analysis. 

SNP discovery - Method 3 

As another alternative, MutS Protein were used to isolate DNA containing 
SNPS for analysis on an array. 3 ugs of DNA was fragmented with Eco R I 
(alternatively a Dnase I could have been used.) At this point an equal amount of 
control DNA was added (this step is optional). 

0.5ug of the fragments were denatured at 95 °C for 10 minutes and gradually 
cooled to 65 ""C over a 60 minute period. The fragments were then incubated at 65 
for 30 minutes and the temperature was ramped down to 25 °C over a 60 minute period. 
1.5 ug MutS protein (Epicentre) was then added and allowed to incubate at room 
temperature for 15 minutes to allow for binding. (Figure 7, Step 1) 

The bound fragments were then digested with 20 units T7 polymerase (New 
England Biolabs) at 30 for 30 minutes. (Figure 7, Step 2) The T7 polymerase was 
inactivated by incubation at 65 °C for 10 minutes. 

Single stranded DNA was trimmed with 100 units of nuclease SI (Boehringer- 
Mannheim) at 16 ""C for 15 minutes. (Figure 7, Step 3) The enzymes inactivated by 
adding 50 nmol EDTA and incubation at 65''C for 15 minutes. 

Adaptor sequences containing PGR primer templates were then ligated to the 
DNA sequences in a 10 ul ligation mixture: lul DNA solution, 4 ul dH20, 1 ul lOX T4 
DNA hgase buffer, 3 ul 10 mM adaptor [5'-d(GATCCGAAGGGGTTCGAATT)-3' 
and 5'-d(pGAATTCGAACCCCTTCGGATC-e') and 1 ul 400 U/ul T4 DNA ligase] 
and incubated at 16 °C overnight and then inactivated at 65 °C for 15 minutes. (Figure 
7, Step 4) 

The sequences were amplified in a 25 ul reaction containing 0.25 pmol template 
DNA, 0.125 units AmpUTaq Gold polymerase (Perkin Elmer), 3 uM primer, [5'- 
d(GATCCGAAGGGGTTCGAATT).3'], 200 uM dNTPs, 15 mM tris-HCl (pH 8.0), 50' 
mM KCl and 1 .5 mM MgCb- 

PGR was performed in a MJ Research Tetrad thermocycler using an initial 10 
minute denaturation at 95 "G, 35 cycles of a 0.5 minute denaturation at 94 °C. 
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annealing for 0.5 minute at 57 and extension at 72 **C. This is followed by a final 5 
minute extension at 72 ^C, 

The sequences were then labeled and hybridized to an array as described above. 

5 SNP discovery - Method 4 

As another alternative, oligonucletides attached to magnetic beads may be used 
for allele specific SNP enrichment and genotyping. Synthesized biotin-tagged 
oligonucleotides containing sequences complementary to the regions of desired SNPs 
were mixed with target DNA in a 1000: 1 ratio. (Alternatively, a 10:1, 20:1, 50:1, 

10 250: 1 or any other ratio could have been chosen.) 

The sample was then denatured at 95 °C for 10 minutes allowed to reanneal by 
slowly cooling to room temperature. 

The sample was then bound to streptavadin-magnetic beads (Promega) by 
mixing the sample and the beads and incubation at room temperature for 10 minutes. 

15 The beads were then washed with 1 X MES with IM Sodium Chloride (NaCl) three 
times. The beads were then resuspended in 50 ullX mung bean nuclease buffer and 
mixed with 1 unit of mung bean nuclease. The beads were then incubated at 30^C for 
15 minutes. The mung bean nuclease was then inactivated by adding 1% SDS. The 
beads were then washed with IX MES with IM NaCl three times. 

20 The beads were then resuspended in ligation mixture containing T4 ligase in 1 

X T4 ligase buffer and 200 fold excess adaptor I sequence [5'- 
d(ATTAACCCTCACTAAAGCTGGAG)-3*and 5*- 

d(pCTCCAGCTTTAGTGAGGGTTAAT)-3* Bpml recognition sites are highlighted 
in boldface] at 16 °C overnight. The ligase was then inactivated by incubation at 65 ^C 

25 for 10 minutes. 

The beads were then washed with 1 X MES with IM NaCl tliree times and then 
resuspended in 50 ul IX Bpm I restriction buffer. BPM I was then added and the 
beads were incubated at 37 °C for 1 hr. The enzyme was inactivated by incubation at 
65 ^C for 10 minutes and the supernatant solution with the sequences containing the 

30 desired SNPs was collected. 
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A second set of adaptor sequences containing PGR template sequences [5'- 
d(pCTATAGTGAGTCGTATT.3') and (5'.AATACGACTCACTATAGNN-3')] and 
ligase were then added to the supernatant solution and incubated at 16 °C overnight. 
The ligase was then heat inactivated at 65 °C for 10 minutes. 

The samples were then amplified with PGR using T3 (S'- 
ATTAACCCTCACTAAAG-S') and T7 5'-d(TAATACGACTCACTATAGGG)-3' 
sequencing primers (Operon) in a 50 ml reaction containing 10^ copies of each target 
DN A, 1 unit AmpliTaq Gold polymerase (Perkin Ehner), 2 uM each primer, 200 uM 
dNTPs, 1 5 mM tris-HCI (pH 8.0), 50 mM KCl and 2.5 mM MgCb. 

PGR was perfonned in a MJ Research Tetrad Thermocycler using an initial 10 
minute denaturation at 95 ''C, 45 cycles of a 0.5 minute denaturation at 94 "^C, 
annealing for 0.5 minute at 52 °C and extension at 72 "^C for 1 minute. This is followed 
by a final 5 minute extension at 72 ""C. The fi:agments were then labeled and hybridized 
to an array. 

Methods of Use 

The present methods of sample pr^aration and analysis are appropriate for a 
wide variety of applications. Any analysis of genomic DNA may be benefitted by a 
reproducible method of complexity management. 

As a preferred embodiment, the present procedure can be used for SNP 
discovery and to genotype individuals. For example, any of the procedures described 
above, alone or in combination, could be used to isolate the SNPs present in one or 
more specific regions of genomic DNA. Arrays could then be designed and 
manufactured on a large scale basis to interrogate only those fi-agments containing the 
regions of interest. Thereafter, a sample from one or more individuals would be 
obtained and prepared using the same techniques which were used to design the array. 
Each sample can then be hybridized to a pre-designed array and the hybridization 
pattern can be analyzed to determine, the genotype of each individuaUr a population of ^ 
individuals as a whole. Methods of use for polymorphisms can be found in, for 
example, co-pending U.S. application 08/813,159. Some methods of use are briefly 
discussed below. 
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Correlation of Polymorphisms with Phenot\T3ic Traits 

Some polymorphisms occur within a protein coding sequence and contribute to 
phenotype by affecting protein structure. The effect may be neutral, beneficial or 
detrimental, or both beneficial and detrimental, depending on the circumstances. For 

5 example, a heterozygous sickle cell mutation (which involves a single nucleotide 

polymorphism) confers resistance to malaria, but a homozygous sickle cell mutation is 
usually lethal. Other polymorphisms occur in noncoding regions but may exert 
phenotypic effects indirectly via influence on replication, transcription, and translation. 
A single polymorphism may affect more than one phenotypic trait. Likev^se, a single 

10 phenotypic trait may be affected by polymorphisms in different genes. Further, some 
polymorphisms predispose an individual to a distinct mutation that is causally related to 
a certain phenotype. 

Phenotypic traits include diseases that have known but hitherto immapped 
genetic components (e.g., agammaglobulimenia, diabetes insipidus, Lesch-Nyhan 

15 syndrome, muscular dystrophy, Wiskott-Aldrich syndrome, Fabry's disease, familial 
hypercholesterolemia, polycystic kidney disease, hereditary spherocytosis, von 
Willebrand's disease, tuberous sclerosis, hereditary hemorrhagic telangiectasia, familial 
colonic polyposis, Ehlers-Danlos syndrome, osteogenesis imperfecta, and acute 
intermittent porphyria). Phenotypic traits also include symptoms of, or susceptibility 

20 to, multifactorial diseases of which a component is or may be genetic, such as 
autoimmune diseases, inflammation, cancer, diseases of the nervous system, and 
infection by pathogenic microorganisms. Some examples of autoimmune diseases 
include rheumatoid arthritis, multiple sclerosis, diabetes (insulin-dependent and non- 
independent), systemic lupus erythematosus and Graves disease. Some examples of 

25 cancers include cancers of the bladder, brain, breast, colon, esophagus, kidney, 

leukemia, Uver, lung, oral cavity, ovary, pancreas, prostate, skin, stomach and uterus. 
Phenotypic traits also include characteristics such as longevity, appearance (e.g., 
baldness, obesity), strengdi, speed, endurance, fertility, and susceptibility or receptivity 
to particular drugs or therapeutic treatments. 

30 Correlation is performed for a population of individuals who have been tested 

for the presence or absence of a phenotypic trait of interest and for polymorphic 
markers sets. To perform such analysis, the presence or absence of a set of 
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polymorphisms (i.e. a polymorphic set) is determined for a set of the individuals, some 
of whom exhibit a particular trait, and some of which exhibit lack of the trait. The 
alleles of each polymorphism of the set are then reviewed to determine whether the 
presence or absence of a particular allele is associated with the trait of interest. 

5 Correlation can be performed by standard statistical methods such as a K-squared test 
and statistically significant correlations between polymorphic form(s) and phenotypic 
characteristics are noted. For example, it might be found that the presence of allele Al 
at polymorphism A correlates with heart disease. As a further example, it might be 
found that the combined presence of allele Al at polymorphism A and allele Bl at 

10 polymorphism B correlates with increased milk production of a fann animal. (See, 
Beitzetal.,US 5,292,639 

Genetic Mapping of Phenotypic Traits 

Linkage analysis is useful for mapping a genetic locus associated with a 
15 phenotypic trait to a chromosomal position, and thereby cloning gene(s) responsible for 
the trait. See Lander et al., Proc, Natl Acad. ScL (USA) 83, 7353-7357 (1986); Lander 
et al., Proc, Nati Acad, ScL (USA) 84, 2363-2367 (1987); Donis-Keller et ah, Ce// 51, 
319-337 (1987); Lander et al.. Genetics 121, 185-199 (1989)). Genes localized by 
linkage can be cloned by a process known as directional cloning. See Wainwright, 
20 Med. J. Australia 1 59, 170-174 (1993); Collins, Nature Genetics 1, 3-6 (1992) (each of 
which is incorporated by reference in its entirety for all purposes). 

Linkage studies are typically performed on members of a family. Available 
members of the family are characterized for the presence or absence of a phenotypic 
trait and for a set of polymorphic markers. The distribution of polymorphic markers in 
25 an informative meiosis is then analyzed to determine which polymorphic markers co- 
segregate with a phenotypic trait. See, e.g., Kerem et al., Science 245, 1073-1080 

(1989) ; Monaco et al., Nature 316, 842 (1985); Yamoka et aL, Neurology 40, 222-226 

(1990) ; Rossiler et al., FASEB Journal 5, 21-27 (1991). 

30 DiseQuilibrium mapping of the entire genome 

Linkage disequilibrium or allelic association is the preferential association of a 
particular allele or genetic marker with a specific allele, or genetic marker at a nearby 
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chromosomal location more frequently than expected by chance for any particular 
allele frequency in the population. For example, if locus X has alleles a and b, which 
occur equally frequently, and linked locus Y has alleles c and d, which occur equally 
frequently, one would expect the combination ac to occur with a frequency of 0.25. If 

5 ac occurs more frequently, then alleles a and c are in linkage disequilibrium. Linkage 
disequilibrium may result from natural selection of certain combination of alleles or 
because an allele has been introduced into a population too recently to have reached 
equilibrium with linked alleles. 

A marker in linkage disequilibrium can be particularly usefiil in detecting 

1 0 susceptibility to disease (or other phenotype) notwithstanding that the marker does not 
cause the disease. For example, a marker (X) that is not itself a causative element of a 
disease, but which is in linkage disequilibrium with a gene (including regulatory 
sequences) (Y) that is a causative element of a phenotype, can be detected to indicate 
susceptibility to the disease in circumstances in which the gene Y may not have been 

1 5 identified or may not be readily detectable. 

Marker assisted breeding 

Genetic markers can decipher the genomes in animals and crop plants. Genetic 
markers can aid a breeder in the understanding, selecting and managing of the genetic 

20 complexity of an agronomic or desirable trait. The agriculture world, for example, has 
a great deal of incentive to try to produce food with a rising number of desirable traits 
(high yield, disease resistance, taste, smell, color, texture, etc.) as consumer demand 
and expectations increase. However, many traits, even when the molecular 
mechanisms are known, are too difficult or costly to monitor during production. 

25 Readibly detectable polymorphisms which are in close physical proximity to the 

desired genes can be used as a proxy to detemiine whether the desired trait is present or 
not in a particular organism. This provides for an efficient screening tool which can 
accelerate the selective breeding process. 



30 Pharmacoeenomics 

Genetic information can provide a powerful tool for doctors to determine what 
course of medicine is best for a particular patient. A recent Science paper entitled 
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"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene 
Expression Monitoring," (to be published 10/15/99 hereby incorporated by reference in 
its entirety for all purposes) discusses the use of genetic information discovered through 
the use of arrays to determine the specific type of cancer a particular patient has. The 
paper goes on to discuss the ways in which particular treatment options can then be 
tailored for each patient's particular type of cancer. Similar uses of genetic information 
for treatment plans have been disclosed for patients with HIV. (See US Patent 
Application 5,861,242). 

The pharmaceutical industry is likewise interested in the area of 
pharmacogenomics. Every year pharmaceutical companies suffer large losses fi*om 
drugs which fail clinical trials for one reason or another. Some of the most difficult are 
those drugs which, while being highly effective for a large percentage of the 
population, prove dangerous or even lethal for a very small percentage of the 
population. Phannacogenomics can be used to correlate a specific genotype with 
specific responses to a drug. The basic idea is to get the right drug to the right patient. 
If pharmaceutical companies (and later, physicians) can accurately remove firom the 
potential recipient pool those patients who would suffer adverse responses to a 
particular drug, many research efforts which are currently being dropped by 
pharmaceutical companies could be resurrected saving hundreds of thousands of dollars 
for the companies and providing many currently unavailable medications to patients. 

Similarly, some medications may be highly effective for only a very small 
percentage of the population while proving only slightly effective or even ineffective to 
a large percentage of patients. Pharmacogenomics allows pharamaceutical companies 
to predict which patients would be the ideal candidate for a particular drug, thereby 
dramatically reducing failure rates and providing greater incentive to companies to 
continue to conduct research into those drugs. 

Forensics 

The capacity to identify a distinguishing or unique set of forensic markers in an 
individual is usefiil for forensic analysis. For example, one can determine whether a 
blood sample firom a suspect matches a blood or other tissue sample from a crime scene 
by determining whether the set of polymorphic forms occupying selected polymorphic 
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sites is the same in the suspect and the sample. If the set of polymorphic markers does 
not match between a suspect and a sample, it can be concluded (barring experimental 
error) that the suspect was not the source of the sample. If the set of markers does 
match, one can conclude that the DNA from the suspect is consistent with that found at 

5 the crime scene. If frequencies of the polymorphic forms at the loci tested have been 
determined (e.g., by analysis of a suitable population of individuals), one can perform a 
statistical analysis to determine the probability that a match of suspect and crime scene 
sample would occur by chance. 

Paternity Testing / Determination of Relatedness 

10 The object of paternity testing is usually to determine whether a male is the 

father of a child. In most cases, the mother of the child is known and thus, the mother's 
contribution to the child's genotype can be traced. Paternity testing investigates 
whether the part of the child's genotype not attributable to the mother is consistent with 
that of the putative father. Paternity testing can be performed by analyzing sets of 

15 polymorphisms in the putative father and the child. Of course, the present invention 
can be expanded to the use of this procedure to determine if one individual is related to 
another. Even more broadly, the present invention can be employed to determine how 
related one individual is to another, for example, between races or species. 

20 

Conclusion 

From the foregoing it can be seen that the advantage of the present invention is 
that it provides a flexible and scalable method for analyzing complex samples of DNA, 
such as genomic DNA. These methods are not limited to any particular type of nucleic 

25 acid sample: plant, bacterial, animal (including human) total genome DNA, RNA, 

cDNA and the like may be analyzed using some or all of the methods disclosed in this 
invention. This invention provides a powerful tool for analysis of complex nucleic acid 
samples. From experiment design to isolation of desired fragments and hybridization 
to an appropriate array, the above invention provides for faster, more efficient and less 

30 expensive methods of complex nucleic acid analysis. 

All publications and patent applications cited above are incorporated by 
reference in their entirety for all purposes to the same extent as if each individual 
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publication or patent application were specifically and individually indicated to be so 
incorporated by reference. Although the present invention has been described in some 
detail by way of illustration and example for purposes of clarity and understanding, it 
will be apparent that certain changes and modifications may be practiced within the 
5 scope of the appended claims. 
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EXHIBIT A 



#! /internet /bin/perl5. 002 -w 



# Copyright (c) 1998 

# Eugene Wang 



# *** BEGIN *** 
# 



#input sequence (File 0) to compare 

^ : 



if ($#ARGV < 2) {die "argv < 2";) 

open ( Enzyme I nput,$ARGV[0)) II die "Cannot open input file $ARGV[0]"; 

#print "Input Enzyme 1 sequence = 
$Elsequence = <EnzymeInput>; 
chomp $E1 sequence; 
$lenElSeq = length {$Elsequence) ; 
$Elsequence tr/a-z/A-Z/; 

$ElExtLoc = <EnzymeInput>; 
chomp ($ElExtLoc) ; 

$lenElTotal = $lenElSeq + $ElExtLoc; 

#print "Input Enzyme 2 sequence = 
$E2sequence = < Enzyme Input >; 
chomp $E2sequence; 

$E2sequence = reverse ($E2sequence) ; 
$lenE2Seq = length ($E2sequence) ; 
$E2sequence tr/a-z/A-Z/; 

$E2ExtLoc = <EnzymeInput>/ 
chomp ($E2ExtLoc) ; 

$lenE2Total = $lenE2Seq + $E2ExtLoc; 

$lenElExtra = $E2ExtLoc - $ElExtLoc; 

$ElSizeStart = <En2ymeInput>; 
chomp ($ElSizeStart) ; 
$ElSizeEnd = <EnzymeInput>; 
chomp {$ElSizeEnd) ; 



# 



#open input FASTA file {File 1) 
# 



#print "Input file name = 
#$fname = <>; 
#chomp $fname; 
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#$fname = "H_DJ0167F23. seq"; 

opendnfile, $ARGV[1] ) II die "Cannot open input file $ARGV[11"; 
5 # 



lopen output file {File 2) 
# 



10 open (Outfile, ">$ARGV[2] ") II die "Cannot open output file $ARGV[2]"; 
#open (Out file, ">output . txt") ; 
tprint Out file "Qualif ier\tSequence" ; 

# 



15 #read input FASTA file 
# 



$line = <Infile>; #header line 

print Outfile "Sline"; 
20 $linecount = 0; 
$FullSeq = ""; 

f 



#check headerline format 
25 # 



chomp $line; 

@fields = split (/\ I /, $line) ; 

30 $ntokens = 0; 

foreach (©fields) { $ntokens++; ) 
#$ntokens - ©fields; 

if ($ntokens > 3) 
35 {$FragmentID - $fields(3];) 

else 
{ 

$line s/^> />/; 
(afields = split (/ /,$line); 
40 $ntokens = 0; 

foreach (@ fields) l$ntokens++; ) 
if ($ntokens > 0) 

{$FragmentID = $fields[0]; $FragmencID s/'^>//;] 
else 

45 {$FragTnentID = "UnknownFragment " ; } 

} 

while ($line = <Infile>) #read in a line 

{ 

50 chomp $line; 

# print "$line\n"; 

$linecount++; 
next if ($line eq ""); 

if (Sline M $line /">/) ##if first char is a 

55 or 

( 

&CompareSeqWithEnzyme_ClassIIs (; ; ##corapare the 

sequences before this line 
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print Outfile "\n\n\n$line\n" ; 
$FullSeq = 
Slinecount = 0; 

5 

0fields = split (/\|/,$line) ; 
$FragmentID = $fields[3]; 
} 

else 

10 { 

$FullSeq .= $line; 
} 

} 

15 #print Outfile "SFullSeq"; 
close (Infile) ; 

# 

20 

#coinpare sequence with FASTA input 

# 



&CompareSeqWithEnzyine_ClassIIs ( ) ; 
25 # 



#close output file 
# 



30 close (Outfile); 



###################################################################### 
############### 
35 #compare sequence with FASTA input 

###################################################################### 

###»#««######## 

sub CompareSeqWithEnzyine_ClassIIs 0 
{ 

40 $lenFullSeq = length ( $FullSeq) ; 

if {$lenFullSeq <= 0) {return(O);} 

print Outfile "TotalLength: \t$lenFullSeq\n"; 

45 

print Outfile "Enzyme top strand: 
print Outfile " { 5\ ' -$Elsequence" ; 

if ($ElExtLoc>0) (print Outfile " (N) $ElExtLoc"; } 
print Outfile "-3\' ) 
50 print Outfile "\n"; 

print Outfile "Enzyme bottom strand: "; 
print Outfile 

if ($E2ExtLoc>0) {print Outfile " (N) $E2ExtLoc"; ) 
55 print Outfile "$E2sequence-3\ • ) " ; 
print Outfile " or 
my $ts = reverse ($E2sequence) ; 
print Outfile "(3\'-$ts"; 
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if ($E2ExtLoc>0) {print Outfile " {N) $E2ExtLoc"; ) 
print Outfile "-5\»)"; 

print Outfile "\n"; 

print Outfile "Segment size: $ElSizeStart - $ElSi2eEnd\n" ; 

SminLen = $lenElTotal < $lenE2Total ? $lenElTotal : $lenE2Total; 
$maxLen = $lenElTotal > $lenE2Total ? $lenElTotal : $lenE2Total; 



$nMatchEl = 0; 
$nSelected = 0; 
@EnzLocLeft = () ; 
@EnzLocRight = () ; 
15 QEnzTypeLeft = {); 
@EnzTypeRight = (); 

if ($ininLen > 0) 
{ 

20 # for ($i=0; $i <= $lenFullSeq-$lenElSeq; $i++) 

for ($1^=0; $i <= $lenFullSeq-$maxLen; $i+ + ) 
{ 

if (substr ($FullSeq, $i, $lenElSeq) eq $Elsequence) 
{ 

25 # $EnzLocLeft [$nMatchEl] = $i + $lenElTotal; 

##have to use push{) 

# $EnzTypeLeft [$nMatchEl] = 1; 

push(@EnzLocLeft, $i + $lenElTotal ) ; 
push ( @EnzTypeLef t , 1 ) ; 

30 

# print Outfile "$nMatchEl\t$i\t "; 

# print Outfile "type l\t"; 

# print Outfile '•$Elsequence\t"; 

# print Outfile substr ($FullSeq, $i, $lenElTotal) ; 
35 # print Outfile "\n"; 

if ($nMatchEl > 0) 
{ 

push(@EnzLocRight,$i + $lenElTotal-l) ; 
40 push {QEnzTypeRight, 1) ; 

} 

$nMatchEl++; 

} 

45 # if (substr ($FullSeq,$i+$E2ExtLoc, $lenE2Seq) eq 



$E2sequence) 
$E2sequence) 



elsif (substr ($FuliSeq, $i+$E2ExtLoc, $lenE2Seq) eq 



{ 

50 # $EnzLocLeft [$nMatchEl] = $i; 

# $En2CutLeft [$nMatchEl] = 2; 

push (@EnzLocLeft, $i) ; 
push((aEnzTypeLeft, 2) ; 

55 # print Outfile "$nMatchEl\t$i\t " ; 

# print Outfile "type 2\t"; 

# print Outfile "SE2sequence\t " ; 

# print Outfile substr {$FullSeq, Si , $lenE2Total) ; 
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print Outfile "\n"; 



if ($nMatchEl > 0) 
{ 

push(@En2LocRight, $i-l) ; 
push(@EnzTypeRight , 2) ; 
) 

$nMatchEl++; 



} 



if (SnMatchEl > 0) 
{ 

15 push (@EnzLocRight^ $1-1) ; 

push(@En2TypeRight,2) ; 
} 

print Outfile "Number of segments: $nMatchEl\n" ; 
20 if ($nMatchEl != ($#EnzLocRight+l) ) {die ("Counting 

error . . . nMatchEl ( SnMatchEl ) ! = $#EnzLocRight " ) ; } 

print Outfile "Matched loci:\n"; 

25 for ($i=0; $i < $nMatchEl; $i++) 

{ 

print Outfile "$EnzLocLef t [Si] \t " ; 
} 

30 print Outfile "\nSegment Size:\n"; 

for ($i-0; $i < $nMatchEl-l; $i++) 

{ 

$tmpSegSize - $EnzLocRight f $i ] - SEnzLocLef t [$i] + 1; 
if ($tmpSegSize >= $ElSizeStart && $tmpSegSize <= 

35 SElSizeEnd) 

{ 

$SegSelected[$nSelected++] = $i; 
} 

print Outfile "$tmpSegSize\t " ; 
40 ) 





45 ## print out the Segment (El) sequences 



print Outfile "\nSegments Selected ($nSelected) : " ; 
for ($i=0; $i < $nSelected; $i++) 
50 { 

$selSeq = $SegSelected[$i] ; 

$Elleft - SEnzLocLeft [$selSeq] ; 

$Elright = $EnzLocRight [$selSeq] ; 

55 if ($lenElExtra > 0) {$Elright $lenElExtra; ) 

else {$Elleft += SlenElExtra; } 
$lenSeiSeq = $Elright - $Elleft + 1; 
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$OutputHeaderLine = . $FragmentID , .SselSeq . 

"\tsi2e=" . $lenSelSeq; 

$OutputHeaderLine .= "\tLoci=" . $Elleft . . $Elright; 

$OutputHeaclerLine .= "\tEnz$EnzTypeLef t [$selSeq) - 
Enz$EnzTypeRight [$selSeql 

print Out file "XnSOutputHeaderLine" ; 
print "$OutputHeaderLine"; 

# Segment sequence 

$SeqEltoNextEl = substr ($FullSeq, $Ellef t , $lenSelSeq) ; 
print Out file "\n$SeqEltoNextEl\n"; 
print "\n$SeqEluoNextEl\n"; 

} 

return (SlenFullSeq) ; 
} 



EXHTOITB 



# ! /internet/bin/perl5.002 -w 

# Copyright (c) 1998 

# Author: Eugene Wang 

# Title: Ligate 

# Purpose: Find matching segments/sequences in two files 
^ + + **** + ********** + *** + * + ** + + + * + ****************** 

if ($#ARGV 2) {die "Number of argv {$#ARGV+1) != 3";} 

# 



#input file 
# 



open(InfileLigate,$ARGV{0] } or die "Open error ... $ARGV (0) \n"; 

$locLigate = <Inf ileLigate> ; 
chomp $locLigate; 
$seqLigate = <Inf ileLigate>; 
chomp $seqLigate; 

close (Inf ileLigate) ; 

# : :r 



#output file 
# 



open ( Inf ile, $ARGV[1] ) or die "Open error ... $ARGV [ 1] \n" ; 
$OutName « $ARGV[2]; 
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open (Out file, ">$OutName"} or die ( "Open error ... $OutName" ) ; 

$alreaciyReadOne = 0; 
$ sequence = 

while ($line <Infile>) tread in a line 

I 

chomp $line; 

next if ($line eq ""); 

if ($line =- /'^t/ I I $line /^>/) ##if first char is a 

or 

{ 

if ($alreadyReadOne ==1) { 

if (&Ligate ($sequence,$locLigate,$seqLigate) 1) { 
print Out file "$headerLine\n" ; 
print Outfile "$sequence\n" ; 
}; 

$sequence 
) 

$headerLine = $line; 
$alreadyReadOne ^ 1; 
} 

else 
{ 

$sequence .= $line; 
} 

} 

if ($alreadyReadOne ==1) { 

if (&Ligate ($sequence, $locLigate, $seqLigate) ==1) { 

print Outfile "$headerLine\n"; 
print Outfile "$sequence\n"; 
); 



close (Infile); 
close (Outfile) ; 



ttH33H######### 

ff compare sequence with Ligation Adapter sequence 

#HHli#####################t##H##########################fr##########»# 

sub LigateO 
{ 

local $retcode = 0; 

local ($seq,$locLigate,$seqLigate) = 

local $lenLigate = length (SseqLigate) ; 
local $lenSeq - length ($seq) ; 

if ( (substr {$seq, $locLigate, SlenLigate) eq $seqLigate) && 

(substr ($seq, $lenSeq-$locLigate-$lenLigate, $lenLigate) eq 
SseqLigate) ) { 
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$ ret code = 1; 
1 



return Sretcode; 
5 } 
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What is claimed is: 

1 . A method of analyzing a first nucleic sample comprising: 
providing said first nucleic acid sample; 

reproducibly reducing the complexity of said first nucleic acid sample to 
5 produce a second nucleic acid sample which may comprise a plurality of non- 
identical sequences whereby said second nucleic acid sample is obtainable by: 
fragmenting said first nucleic acid sample to produce firagments and 
ligating adaptor sequences to said fi-agments; 

firagmenting said first nucleic acid sample to produce fi*agments, 
10 denaturing said fragments, allowing some of said fragments to reanneal to form 
double stranded DNA sequences and removing said double stranded DNA 
sequences. 

amplification by arbitrarily primed PCR; 

hybridizing said first nucleic acid sample to an oligonucleotide probe 
1 5 bound to a solid support; 

hybridizing said first nucleic acid sequence to a mismatch binding 

protein; 

providing a nucleic acid array; 

hybridizing said second nucleic acid sample to said array; and 
20 analyzing a hybridization pattern resulting from said hybridization. 

2. The method of claim 1 wherein said second nucleic acid sample comprises at 
least 0.5 % of said nucleic acid sample 

25 3, The method of claim 1 wherein said second nucleic acid sample comprises at 

least 3 % of said nucleic acid sample 

4. The method of claim 1 wherein said second nucleic acid sample comprises at 
least 12 % of said nucleic acid sample 
30 at least 12% 
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5. The method of claim 1 wherein said second nucleic acid sample comprises at 
least 50 % of said nucleic acid sample 

6. The method of claim I wherein each of said non-identical sequences differs 
from the other non-identical sequences by at least 5 nucleic acid bases. 

7. The method of claim 1 wherein each of said non-identical sequences differs 
from the other non-identical sequences by at least 10 nucleic acid bases. 

8. The method of claim 1 wherein each of said non-identical sequences differs 
from the other non-identical sequences by at least 50 nucleic acid bases. 

9. The method of claim 1 wherein each of said non-identical sequences differs 
from the other non-identical sequences by at least 1000 nucleic acid bases. 

10. The method of claim 1 wherein said NA sample is DNA. 

1 1 . The method of claim 1 wherein said NA sample is genomic DNA. 

12. The method of claim 1 wherein said first nucleic acid sample is cDNA 
derived from RNA or mRNA. 

13. The method of claim 1 further comprising the step of amplifying at least 
one of the non-identical sequences in said second nucleic acid sample. 

14. The method of claim 13 wherein said step of amplifying is performed by a 
poljTnerase chain reaction (PGR). 

1 5. The method of claim 1 wherein the entire method is performed in a single 
reaction vessel. 
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16. The method of claim 1 wherein said step of fragmenting the first nucleic 
acid sample comprises digestion with at least one restriction enzyme. 

17. The method of claim 1 wherein said step of fragmenting the first nucleic 
acid sample comprises digestion with a type lis endonuclease. 

18. The method of claim 1 wherein said adaptor sequences comprise PCR 
primer template sequences. 

19. The method of claim 1 wherein said adaptor sequences comprise tag 
sequences. 

20. The method of claim 1 wherein said solid support is a magnetic bead. 

21 . The method of claim 1 wherein said mismatch binding protein is bound to a 
magnetic bead. 

22. The method of claim 1 wherein said method for analyzing a nucleic acid 
sample comprises determining whether the nucleic acid sample contains sequence 
variations. 

23. The method of claim 22 wherein said sequence variations are single 
nucleotide polymorphisms. 

24. The method of claim 1 wherein the step of obtaining a DNA array 
comprises: 

designing a DNA array to query DNA fi-agments which have been produced by 
the identical procedures used to obtain said second nucleic acid sample. 

25. The method of claim 24 wherein the step of designing further requires 
predetermining the sequences contained in said second nucleic acid sample. 
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26. The method of claim wherein said step of predetermining the sequences 
contained in said second nucleic acid sample is conducted in a computer system. 

27. The method of claim 23 wherein said second nucleic acid sample is 
5 obtainable by: 

binding oligonucleotide probes containing a desired SNP sequence to magnetic 
beads to form probe-bead complexes; and 

hybridizing said probe-bead complexes to said DNA sample; 
exposing said hybridized DNA sample to a single strand DNA nuclease to 
10 remove single stranded DNA thereby forming a DNA duplex; 

ligating a double stranded adaptor sequence comprising a restriction enzyme 
site to said DNA duplex; 

digesting said DNA duplex with a restriction enzyme to release the magnetic 



bead; and 



15 



isolating only those fragments containing said SNP sequence. 



28. The method of claim 25 wherein said restriction enzyme is a Class lis 



endonuclease. 



20 



29. The method of claim 23 wherein said second nucleic acid sample is 



obtainable by: 



exposing the DNA sample to a mismatch bonding protein; 
employing a 3' to 5* exonuclease to remove single stranded DNA; and 
employing a nuclease to remove single stranded DNA. 



25 



30. A method of screening for DNA sequence variations in an individual 



compnsmg: 



providing said first nucleic acid sample fi'om said individual; 



30 



providing a second nucleic acid sample by reproducibly reducing the 
complexity of said first nucleic acid sample to produce a second nucleic acid 
sample which may comprise a plurality of non-identical sequences whereby said 
second nucleic acid sample is obtainable by: 
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fragmenting said first nucleic acid sample to produce fragments and 
ligating adaptor sequences to said fragments; 

fragmenting said first nucleic acid sample to produce fragments, 
denaturing said fragments, allowing some of said fragments to reanneal to form 
double stranded DNA sequences and removing said double stranded DNA 
sequences. 

amplification by arbitrarily primed PCR; 

hybridizing said first nucleic acid sample to an oligonucleotide probe 
bound to a solid support; 

hybridizing said first nucleic acid sequence to a mismatch binding 

protein; 

providing a nucleic acid array; 

hybridizing said second nucleic acid sample to said array; and 
analyzing a hybridization pattern resulting from said hybridization. 

3 1 . The method of claim 30 wherein said sequence variation is a SNP. 

32. The method of claim 31 wherein said SNP is associated with a disease. 

33. The method of claim 31 wherein said SNP is associated with the efficacy of 

a drug. 

34. A method of screening for DNA sequence variations in a population of 
individuals comprising: 

providing said a first nucleic acid sample from each of said individuals; 

providing a second nucleic acid sample by reproducibly reducing the 
complexity of said first nucleic acid sample to produce a second nucleic acid 
sample which may comprise a plurality of non-identical sequences whereby said 
second nucleic acid sample is obtainable by: 

fragmenting said first nucleic acid sample to produce fragments and 
ligating adaptor sequences to said fragments; 
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fragmenting said first nucleic acid sample to produce fragments, 
denaturing said fragments, allowing some of said fragments to reanneal to form 
double stranded DNA sequences and removing said double stranded DNA 
sequences. 

amplification by arbitrarily primed PCR; 

hybridizing said first nucleic acid sample to an oligonucleotide probe 
bound to a solid support; 

hybridizing said first nucleic acid sequence to a mismatch binding 

protein; 

providing a nucleic acid array; 

hybridizing said second nucleic acid sample to said array; and 
analyzing a hybridization pattern resulting from said hybridization. 

35. The method of claim 34 further comprising the step of compiling the 
analyses of each individual's hybridization pattern. 

36. The method of claim 34 wherein said sequence variation is a SNP. 

37. In a computer system, a method of designing an array comprising: 
modeling specific enzymatic reactions between a known nucleic acid sequence 

and an enzyme; 

obtaining the results of said modeled enzymatic reactions; 
obtaining probe sequences based upon said results; and 
designing an array to contain said probe sequences. 

38. A method of analyzing a plurality of nucleic acid samples, comprising 
treating a first nucleic acid sample according to a defined procedure that 

produces a first population of fragments, the collective sequences of the fragments 
comprising a subset of the collective sequences present in the first nucleic acid sample, 

determining abundance or composition of a subset of the first population of 
fragments; 



12/13/2005, EAST Version: 2.0.1.4 



wo 00/24939 



51 



PCTAJS99/25200 



treating a second nucleic acid sample according to the defined procedure to 
produce a second population of fragments containing corresponding fragments to the 
fragments in the first population; 

determining abundance or composition of a subset of fragments in the second 
5 population having sequences corresponding to the subset of fragments in the first 
population. 
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